Okkhor: A Synthetic Corpus of Bangla Printed Characters

Mridul Banik, Md Jamiur Rahman Rifat, Jebun Nahar, Nazmul Hasan, Fuad Rahman
Accepted to be presented at FTC 2020 - Future Technologies Conference 2020, 5-6 November 2020, Vancouver, Canada

Description

Bangla is the fifth most-spoken native language in the world. Despite having such a large number of speakers, the resources related to development of language processing solutions are very limited. To realize the full potential of Machine Learning (ML) and Artificial Intelligence (AI) solutions for computer vision and Natural Language Processing (NLP), a complete and standardized fully-annotated corpus is an essential prerequisite. Specifically, development of Optical Character Recognition systems (OCRs) for printed characters, an important resource for language automatic and digitization, requires a large corpus with high coverage and variability of fonts, representing the nuances of the language usage, which does not exist for Bangla. In this paper, we present a novel synthetic corpus of over 5 million printed Bangla characters containing 60 alphanumeric characters, 10 vowel modifiers, 159 compound characters, which corresponds to 229 different classes of both Unicode and ASCII encodings. This is entirely novel work, since there exists no such corpus currently for the Bangla language

Publication

Okkhor: A Synthetic Corpus of Bangla Printed Characters

Description